upper confidence primal-dual reinforcement learning
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
We consider online learning for episodic stochastically constrained Markov decision processes (CMDP), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, whereas both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the MDP is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space $\mathcal{S}$ and the action space $\mathcal{A}$. In this work, we propose a new \emph{upper confidence primal-dual} algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves $\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T})$ upper bounds of both the regret and the constraint violation, where $L$ is the length of each episode. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of ``optimism in the face of uncertainty'' in constrained online learning.
Review for NeurIPS paper: Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
Weaknesses: (W1): As such the high-level outline of the proof strategy follows previous procedures for drift analysis in (Yu et al. 2017) and MDP analysis in (Neu et al. 2012 and Rosenberg et al. 2019). Lemma B.2 is very similar to Lemma 4 in Neu et al. 2012 and Lemma B.2 in Rosenberg et al. 2019. Lemma 5.2 mirrors Lemma 8 in Yu et al. 2017. Technical lemmas for stochastic analysis are also from the previous paper: (Lemma B.6 and B.7 are Lemma 5 and 9 in Yu et al. 2017). The main lemma, Lemma 5.3, has the same goal as Lemma 7 in Yu et al. 2017, which is to show Q_t satisfies the drift condition stated in Lemma 5 in Yu et al. 2017. Lemma 5.6 is also exact as Lemma 3 in Yu et al. 2017.
Review for NeurIPS paper: Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
I want to thank the authors for preparing the detailed rebuttal. This paper was discussed among all the reviewers during the post-rebuttal discussion phase. Overall, the reviewers are excited about this work on solving constrained MDP problems and have a positive assessment of the paper. All the reviewers acknowledged the theoretical contributions, especially in a challenging setting with unknown dynamics and non-stationary loss function. There was a clear consensus that the paper should be accepted.
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
We consider online learning for episodic stochastically constrained Markov decision processes (CMDP), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, whereas both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the MDP is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space \mathcal{S} and the action space \mathcal{A} . In this work, we propose a new \emph{upper confidence primal-dual} algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves \widetilde{\mathcal{O}}(L \mathcal{S} \sqrt{ \mathcal{A} T}) upper bounds of both the regret and the constraint violation, where L is the length of each episode.